Graph Mining Algorithms for Memory Leak Diagnosis and Biological Database Clustering
نویسندگان
چکیده
Large graph-based datasets are common to many applications because of the additional structure provided to data by graphs. Patterns extracted from graphs must adhere to these structural properties, making them a more complex class of patterns to identify. The role of graph mining is to efficiently extract these patterns and quantify their significance. In this thesis, we focus on two application domains and demonstrate the design of graph mining algorithms in these domains. First, we investigate the use of graph grammar mining as a tool for diagnosing potential memory leaks from Java heap dumps. Memory leaks occur when memory that is no longer in use fails to be reclaimed, resulting in significant slowdowns, exhaustion of available storage, and eventually application crashes. Analyzing the heap dump of a program is a common strategy used in memory leak diagnosis, but our work is the first to employ a graph mining approach to the problem. Memory leaks accumulate in the heap as classes of subgraphs and the allocation paths from which they emanate can be explored to contextualize the leak source. We show that it suffices to mine the dominator tree of the heap dump, which is significantly smaller than the underlying graph. We demonstrate several synthetic as well as real-world examples of heap dumps for which our approach provides more insight into the problem than state-of-the-art tools such as Eclipse's MAT. Second, we study the problem of multipartite graph clustering as an approach to database summarization on an integrated biological database. Construction of such databases has become a common theme in biological research, where heterogeneous data is consolidated into a single, centralized repository that provides a structured forum for data analysis. We present an efficient approximation algorithm for identifying clusters that form multipartite cliques spanning multiple database tables. We show that our algorithm computes a lossless compression of the database by summarizing it into a reduced set of biologically meaningful clusters. Our algorithm is applied to data from Caenorhabditis elegans, but we note its applicability to general relational databases. played instrumental roles in its success. I would like to thank Dr. Back for providing expertise and domain knowledge in regards to memory leaks, the Java runtime environment, and other software engineering concepts as well as for his significant contributions involving the joint work and composition of the content based in Chapter 3. I would like to thank Dr. Heath for offering his …
منابع مشابه
خوشهبندی اسناد مبتنی بر آنتولوژی و رویکرد فازی
Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...
متن کاملComparison of the Efficiency of Data Mining Algorithms in Predicting the Diagnosis of Diabetes
Background: Diabetes is one of the major health problems in Iran and about 4.6 million adults suffer from this disease. Poor diagnosis of this disease has caused half of this number to be unaware of their disease. In recent years, along with the use of computers in data analysis and storage, the volume and complexity of data has increased dramatically. Methods: In health organizations, data pl...
متن کاملClassification and Analysis of Frequent Subgraphs Mining Algorithms
In recent years, data mining in graphs or graph mining have attracted much attention due to explosive growth in generating graph databases. The graph database is one type of database that consists of either a single large graph or a number of relatively small graphs. Some applications that produce graph database are as follows: Biological networks, semantic web and behavioral modeling. Among al...
متن کاملA New Knowledge-Based System for Diagnosis of Breast Cancer by a combination of the Affinity Propagation and Firefly Algorithms
Breast cancer has become a widespread disease around the world in young women. Expert systems, developed by data mining techniques, are valuable tools in diagnosis of breast cancer and can help physicians for decision making process. This paper presents a new hybrid data mining approach to classify two groups of breast cancer patients (malignant and benign). The proposed approach, AP-AMBFA, con...
متن کاملAnalyzing and Predicting Large Vector-, Graph- and Spatio-Temporal Data
Large social graph datasets, pertaining to millions of social network users and the billions of relationships between them; complex, high dimensional vector data of large database systems; and petabytes of environmental sensor data are being generated every day. Employing this flood of data for the benefit of all, is one of the main challenges of the 21st century[129, 88, 51]. This thesis advan...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2010